Hierarchical Non-Emitting Markov Models^] 

Eric Sven Ristad Robert G. Thomas 

Department of Computer Science 

Princeton University 
Research Report CS-TR-544-97 
May 1997; Revised January 1998 



Abstract 

We describe a simple variant of the interpolated Markov model with non- 
emitting state transitions and prove that it is strictly more powerful than any 
Markov model. More importantly, the non-emitting model outperforms the clas- 
sic interpolated model on natural language texts under a wide range of experi- 
mental conditions, with only a modest increase in computational requirements. 
The non-emitting model is also much less prone to overfitting. 
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1 Introduction 



The Markov model has long been the core technology of statistical language 
modeling. Many other models have been proposed, but none has offered a better 
combination of predictive performance, computational efficiency, and ease of 
implementation. Here we add hierarchical non-emitting state transitions to the 
Markov model. Although the states in our model remain Markovian, the model 
itself is no longer Markovian because it can represent unbounded dependencies 
in the state order distribution. Consequently, the non-emitting Markov model is 
strictly more powerful than any Markov model, including the context model J]19|, 
p0| , |2q| , the backoff model || 13 , and the interpolated Markov model jij], |l4| . 
More importantly, the non-emitting model consistently outperforms the best 
Markov models on natural language texts, under a wide range of experimental 
conditions. The non-emitting model is also nearly as computationally efficient 
and easy to implement as the interpolated Markov model. 

The remainder of our report consists of five sections and one appendix. In 
section ||, motivate the fundamental problem of time series prediction, which is 
to combine the probabilities of events of different orders. Section || reviews the 
interpolated Markov model and briefly demonstrates the equivalence of inter- 
polated models and basic Markov models of the same model order. Next, we 
introduce the hierarchical non-emitting Markov model in section |I| and prove 
that even a second order non-emitting model is strictly more powerful than any 
Markov model, of any model order. Section [5] provides efficient algorithms to 
optimize the parameters of a non-emitting model on data. In section we re- 
port empirical results for the interpolated model and the non-emitting model on 
the Brown corpus and Wall Street Journal. Finally, in section [7] we conjecture 
that the non-emitting model excels empirically because it imposes a pseudo- 
Bayesian discipline on maximum likelihood techniques. Appendix |a] reviews 
the backoff model and explains how to construct a non-emitting backoff model 
that is strictly more powerful than any backoff model. 

Our notation is as follows. Let A be a finite alphabet of distinct symbols, 
\A\ = k, and let x T S A T denote an arbitrary string of length T over the 
alphabet A. Then x\ denotes the substring of x T that begins at position i and 
ends at position j. For convenience, we abbreviate the unit length substring x\ 
as Xi and the length t prefix of x T as x t . 



2 Time Series Prediction 

A time series model must assign accurate probabilities to strings of unbounded 
length. Yet unbounded strings don't occur in recorded histories, which are 
always finite. Therefore, to estimate the probabilities of unbounded strings 
from a finite corpus, we must assume that each symbol in a given string depends 
only on a finite number of (equivalence classes of) contexts. The most widely 
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adopted independence assumption is the order n Markov assumption, which 
states that each symbol depends only on the immediately preceding n symbols, 
and is conditionally independent of the distant past. 

p(x T \T) = IlLpOrtIs*- 1 ) 

The simplest statistical model to incorporate an order n Markov assumption 
is the basic Markov model. A basic Markov model <fi = (A, n, S n ) consists of 
an alphabet A, a model order n, n > 0, and the state transition probabilities 
S n : A n x A — > [0, 1]. With probability S n (y\x n ), a Markov model in the state x n 
will emit the symbol y and transition to the state x% y. Therefore, the probability 
Pmt^tl^* -1 , (f)) assigned by an order n basic Markov model 4> to a symbol x l in 
the history a;* -1 depends only on the last n symbols of the history. 

p m (x t \x t ~ x ,<j))=8 n {xt\x\z 1 n ) (1) 

Since the Markov model contains only a finite number of parameters, it is in 
principle possible to estimate their values directly from data. All that remains 
is to choose the model order. 

In real-world time series problems, the future depends on the entire past, 
even if only weakly. In order to more closely approximate a real-world source, 
we would like our model order to be as large as possible. Yet we have only a 
finite amount of training data from which to estimate our model parameters. 
An order n Markov model over an alphabet of k symbols has k n+1 events, 
while a corpus of length T has at most T — n distinct events of order n. The 
exponential growth in events quickly exceeds the size of all available training 
data, and nearly all the higher-order events do not occur in the training data. 

This tension between model complexity and data sparsity is fundamental to 
time series modeling. The probabilities of the lower order events can be more 
accurately estimated from the available training data, while the higher order 
events are better able to model complex real-world sources. An effective model, 
then, must include individual events of both higher and lower orders. 

The two most widely-used techniques for combining individual events of 
varying orders are backoff and interpolation. In an interpolated model, the 
transition probabilities from lower and higher order states are combined stochas- 
tically using mixing parameters. In a backoff model, the event probabilities are 
combined according to a partial order which typically favors higher order events 
over lower order events. In section ^| and appendix |A|, we show that back- 
off models and interpolated models are formally equivalent to basic Markov 
models. Therefore, backoff and interpolation are simply parameter estimation 
schemes for basic Markov models. 
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3 Interpolation 



Here we introduce the interpolated Markov model and explain why the interpo- 
lated model class is equivalent to the class of basic Markov models. In the next 
section ^, we introduce hierarchical non-emitting state transitions to the Markov 
model, and prove that the new non-emitting models are no longer Markovian 
even though their states are. 

In the interpolated Markov model, the transition probabilities from states of 
different orders are combined using state-conditional mixing parameters. The 
mixing parameters smooth the transition probabilities from higher order states 
with those from lower order states [l^] . Mixing the transition probabilities 
from states of different orders results in more accurate predictions than can be 
obtained from any fixed model order. 

Formally, an interpolated Markov model cf) = (A, n, S, A) consists of a finite 
alphabet A, a maximal model order n, the state transition probabilities S = 
So . . . S n , Si : A 1 x A — > [0,1], and the state-conditional interpolation parameters 
A : A n x [0,n] — > [0,1]. The state order is a hidden variable. The probability 
assigned by an interpolated model is a linear combination of the probabilities 
assigned by all the lower order Markov models. 

n 

P Mx n A) = Y. 5 M xl ) x ^ ( 2 ) 

i=0 

An interpolated model is a valid probability model if every Si(-\x l ) and every 
\(i\x n ) is valid. It is nonzero for all strings A* if So(-) is strictly positive for all 
symbols A and no X(i\x n ) is unity when <5(-|a; 1 ) is zero for some symbol. 

Estimating the 0(nk n ) state interpolation probabilities is considerably easier 
than estimating the 0(k n+1 ) state transition probabilities in an order n Markov 
model. To begin with, we set A(i\x n ) to if the order i state x l is novel. Now 
we need only to estimate the 0(nT) interpolation parameters that have been 
observed in the training data. 

Nonetheless, there are still too many interpolation parameters to be ac- 
curately estimated. Further refinements are necessary to improve predictive 
performance. One refinement is to group similar parameters into equivalence 
classes and then constrain them to take the same values. This is called param- 
eter tying. At one extreme, each state-conditional interpolation distribution is 
its own equivalence class. At the other extreme, all interpolation probabilities 
are tied together and we have the state-independent interpolated Markov model 

n 

p c (y\x n ,<p) = J2^i(yWi (3) 

with only n + 1 interpolation parameters. While parameter tying can improve 
performance, reducing state-conditional interpolation to state-independent in- 
terpolation results in poor performance. 
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A hierarchical parameterization of the full state-conditional interpolation is 
more effective. Let Xi : A % — > [0, 1] be the set of i th order state interpolation 
parameters, where Xi(x l ) is the probability of using the ? th order state transition 
probability Si(-\x l ), conditioned on the decision not to use any higher order state 
transition probability. 

n 

X(i\x n ) = A,« +1 _ 4 ) J] (1 \j(x™ +1 _j)) 
j=i+i 

Then the probability p c (y\x n , <f>) that the state x n will emit the symbol y has a 
particularly simple form 

Pc{y\x\(j)) = X l {x l )8 l {y\x' t ) . . 

+ (l-X l (x*)) Pc (y\xl,<j ) ) W 

where Xi(x l ) = for i > n, and therefore p c (xt|x* _1 , 0) = p c (xt|x*Z n , 0), ie., 
the prediction depends only on the last n symbols of the history. 

A quick glance at the form of (^) and (|l|) reveals the fundamental simplicity 
of the interpolated Markov model. Every interpolated model is equivalent to 
a basic Markov model of the same order, and every basic Markov model is an 
interpolated model of the same order. We may convert an interpolated model 
4> into a basic model <j)' of the same model order n, simply by setting S' n (y\x n ) 
equal to p c (y\x n , <ft) for all states x n £ A n and symbols y G A. Thus, the class 
interpolated Markov models is extensionally equivalent to the class of basic 
Markov models. 



4 Non-Emitting Transitions 

In the previous section, we explained how to combine events of varying orders 
using interpolation and backoff. Interpolation and backoff both use the proba- 
bilities of lower events to estimate the probabilities of higher order events. As 
a result, interpolated and backoff models are extensionally equivalent to each 
other and to basic Markov models of the same order. In this section, we explain 
how to combine events of varying orders using non-emitting state transitions. 

The central idea is to allow actual non-emitting transitions between events of 
different orders. Unlike interpolation and backoff, non-emitting transitions are 
not merely an estimation method - they actually increase the expressive power 
of the model class. As a result, non-emitting models are strictly more powerful 
than the class of basic Markov models. The next section [| provides efficient 
algorithms to evaluate the probability of a string according to a non-emitting 
model and to optimize the parameters of a non-emitting model on data. 

A non-emitting mixture Markov model <p = (A, n, 5, A) consists of a finite 
alphabet A, a maximal model order n, the emitting state transition probabilities 
Si : A 1 x A — > [0,1], and the non-emitting state transition probabilities Xi : 
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A 1 x [0, n] — > [0,1]. The non-emitting model alternates between non-emitting 
and emitting transitions according to the A and 5 parameters, respectively. The 
parameter \(j\x l ) specifies the probability that the model will transition from 
the state x l to the state x 3 without emitting a symbol. The parameter 5j(y\x 3 ) 
specifies the probability that the model will emit the symbol y from the state 
x^ and transition to the successor state x 3 y. Then the probability p t {y 3 \x l , (j>) 
assigned to a string y^ in the state x l has the form 

i 

P 4y 3 \x\^=Y,HlW)S l (y 1 \x l )p e (yi\x l y^). (5) 

1=0 

When the model order is sufficiently high, then a hierarchical parameteri- 
zation of the non-emitting transition probabilities may improve performance. 
With probability 1 — \i{x l ), a hierarchical non-emitting model will transition 
from the state x z to the state x\ without emitting a symbol. With probability 
\i(x l )Si(y\x' 1 ), the model will transition from the state x % to the state x % y and 
emit the symbol y. 

Therefore, the probability p e (y 3 \x l , 4>) assigned to a string y J in the history 
x l by a hierarchical non-emitting model has the recursive form (^|) , 

Pe(y 3 \x\<f>) = \ i {x % )5 i {yi\x % )p t {y 3 2 \x l y l ,(j>) , fi , 
+(l-\ i (x i ))p e (y 3 \xi<t>) y > 

where \i{x l ) = for i > n and Ao(e) = 1. Note that, unlike the basic Markov 
model, p e (xt\x t ^ 1 , (/)) ^ p e (xt\x\z_\, <j)) because the state order distribution of 
the non-emitting model depends on the prefix x l ~ n . This simple fact will allow 
us to establish that there exists a non-emitting model that is not equivalent to 
any Marko v m odel. 

states that there exists a non-emitting model tf> that cannot be 



Lemma 4.1 



converted into an equivalent basic model of any order. There will always be 
a string x T that distinguishes the non-emitting model <f> from any given basic 
model (f>' because the non-emitting model can encode unbounded dependencies 
in its state distribution. 

Lemma 4.1 3<P V0' 3x T € A* [p,(x T \<p,T) ^ Pm (x T \(j)' ,T)] 

Proof. The idea of the proof is that our non-emitting model will encode the first 
symbol x% of the string x T in its state distribution, for an unbounded distance. 
This will allow it to predict the last symbol Xt using its knowledge of the first 
symbol X\. The basic model will only be able predict the last symbol Xt using 
the preceding n symbols, and therefore when T is greater than n, we can arrange 
for p e (x T \4>, T) to differ from any p m {x T \4>' , T), simply by our choice of x\. 

The smallest non-emitting model capable of exhibiting the required behav- 
ior has order 2. Lower order non-emitting models are equivalent to interpolated 
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models of the same order, with the same parameters. The non-emitting transi- 
tion probabilities A and the interior of the string x^ 1 will be chosen so that the 
non-emitting model is either in an order 2 state or an order state, with no way 
to transition from one to the other. The first symbol x\ will determine whether 
the non-emitting model goes to the order 2 state or stays in the order state. 
No matter what probability the basic model assigns to the final symbol Xt, the 
non-emitting model can assign a different probability by the appropriate choice 

of x%, Sq(xt), and ^(^tI^t"^)- 

Consider the second order non-emitting model over a binary alphabet with 
A(0) = 1, A(l) = 0, and A(ll) = 1 on strings in A1*A. When x x = 0, then 
22 will be predicted using the 1st order model 8\{x2\x\), and all subsequent Xt 
will be predicted by the second order model £2(2* (a^I^)- When x\ — 1, then all 
subsequent x t will be predicted by the zeroth order model So(x t ). Thus for all 
t > p, p e (x t \x t ^ 1 ) ^ p e (xt \xlZp) for any fixed p, and no basic model is equivalent 
to this simple non-emitting model. □ 

Every basic model is a non-emitting model, with the appropriate choice of 
non-emitting transition probabilities. 

Lemma 4.2 V0 3<t>' \/x T € A* [p e (x T \<fi' ,T) — p m (x T \(f>,T)] 

Proof. A basic model <fr = (A, n, S n ) is equivalent to a non-emitting model 
<f>' = (A, n, 8' , A') where 8' n = 8 n and X'(n\x n ) — 1 for all x n . In the hierarchical 
parameterization, \'(x n ) — 1 for all x n . □ 

Therefore, the class V e of non-emitting Markov distributions is strictly more 
powerful than the class V m of basic Markov distributions. 



Theorem 1 ? m C? e 



□ 



Proof. V m 7^ V t by lemma 4.1 and V m C V e by lemma 4.2. 

Since interpolated models and backoff models are equivalent to basic Markov 
models, we have as a corollary that non-emitting Markov models are strictly 
more powerful than interpolated and backoff models. Note that non-emitting 
Markov models are considerably less powerful than the full class of stochastic 
finite state automata because their states are Markovian. For the same reason, 
non-emitting models are also less powerful than the full class of hidden Markov 
models. 

Let us now turn to the algorithms required to evaluate the probability of 
a string according to a non-emitting mixture model and to optimize the non- 
emitting state transitions on a training corpus. 



5 Estimation 

Here we present an efficient expectation-maximization (EM) algorithm to op- 
timize the parameters of a hierarchical non-emitting mixture model on data. 
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An EM algorithm iteratively maximizes the probability of the training data ac- 
cording to the model by computing the expectation of model parameters on the 
data and then updating the model parameters to maximize those expectations 
ill- 

The non-emitting mixture model is sufficiently expressive that any max- 
imum likelihood estimator will overfit its parameters to the training corpus. 
Unseen events will be assigned zero probability, and the overfit model will fail 
to accurately predict the future. The traditional solution to this problem for in- 



terpolated Markov models is cross-estimation 12 . Cross-estimation repeatedly 
partitions the training data into two blocks and optimizes the mixing parame- 
ters on one block after initializing the state transition parameters on the other 
block. We present a traditional cross-estimation algorithm for hierarchical non- 
emitting models. 

We begin by partitioning the training corpus into a fixed set of blocks B. 
Ideally our partition is linguistically meaningful and roughly uniform, but nei- 
ther condition is essential. For example, we might divide a natural language 
text corpus on sentence, paragraph, or article boundaries. Next we call CROSS- 
estimate(B,0) on our hierarchical non-emitting model (j>. 

CROSS-ESTIMATE(B ,4>) 

1. Until convergence 

2. Initialize A + , A~ to zero; 

3. For each block Bi in B 

4. Initialize 5 using B — B L ; 

5. EXPECTATION-STEP(£?i,0,A + ,A~); 

6. MAXIMIZATION-STEP((^,A + ,A~); 

7. Initialize 5 using B; 

The variables A + (x l ) and A~ (x 1 ) accumulate expectations for the non-emitting 
state transition parameter \(x % ). \ + {x l ) contains the expectation of emitting 
a symbol in state x l , conditioned on being in state x l , while A" (a; 1 ) contains 
the expectation of transitioning to x l 2 without emitting a symbol, conditioned 
on being in state x l . Lines 3-5 enumerate all one-block partitions of the train- 
ing corpus. The emitting state transitions S are initialized to their maximum 
likelihood estimates on the larger block B — Bi and then the non-emitting state 
transitions A are optimized on the smaller "withheld" block Bi. 

The heart of the algorithm is the expectation-STEP() procedure, which 
calculates the expectation of the non-emitting transitions on the string x b and 
then increments the A + ,A~ accumulators. 

expectation-step^ ,(f>, A + , A~ ) 

1. a — FORWARD^ 6 , 0); 

2. (3 — BACKWARD(x b ,</>); 

3. for t = b downto 1 
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4. for z = 1 upto min(n, t) 

5. A t -(i) = a t (i)(l-A t (*))A(*-l); 

6. A+ x (i - 1)+ = a t _i(i - l)At_i(i - l)<St-i(* - l)/3t(<); 

7. if (t > n) [ A^"_ 1 (n)+ = a t _i(ra)A t _i(n)<5 t _i(ra)/3 t (n); ] 



The forward variable at(i) contains the probability p(x*,Ot = i\4>) that the 
model (f> generated the prefix x* and terminated in the order i state. The 
backward variable /3t(i) contains the probability p(xt+ibT\x , ot = i, 0) that the 
model (f> generated the suffix x\ +1 given that it was in the order i state at time 
t. To simplify the notation, we define \t(i) to be the probability X(x t t+1 _ i ) of 
emitting a symbol from the i th order state at time t, given that we are in that 
state. We also define 5t(i) to be the probability Si(x t+ i\x t t+1 _ i ) of the emitting 
transition from state x t t+1 _ i to state x*i 1 _ i . 

The expectation-STEpQ algorithm requires 0{nb) time and space for an 
order n non-emitting model on a string x b of length b. A comparable inter- 
polated model can take an expectation step in 0(nb) time and O(l) space Q. 
While the difference between 0{nb) and 0(1) space can be considerable, the 
additional space requirements of the non-emitting algorithm are small when 
compared to the cost of storing all the model parameters. An order n mixture 
model has 0(nT) parameters for a training corpus of size T, and the training 
corpus is typically an order of magnitude larger than the withheld block. 

FORWARD(x T ,0) 



1. oo(0) = l; 

2. for t = 1 upto T - I 

3. for i = min(n — f , t) downto 

4. a t {i)+ = a t {i + !){!- X t (i + 1)); 

5. a t+ i(i + 1) := a t (i)\ t (i)S t (i); 

6. if (t > n) [ a t+ i(n)+ = a t (n)X t (n)S t (n); ] 

7. return(a); 

backward(x t ,<?!>) 

1. for i = upto min(n — f, T — f); 

2. Pr-i(i) = X T -i(i)S T -i(i); 

3. if (T > n) [ pT-x{n) = A r _i(n)*r-i(»); ] 

4. for t = T - 1 downto 1 

5. for i = 1 upto min(n, t) 

6. /3t(i)+ = (l-A t (i))&(i-l); 

7. /3t_i(i - 1) - A t _i(i - i)at_ x (* - 

8. if (t > n) [ A-iW - At-iCnJit-iH/JrCn); ] 

9. return(/3); 



The forward() and BACK\vard() algorithms each require 0(nT) time and 
space. It is possible to evaluate the probability p e (x T |</>) of a string x T according 
to an order n non-emitting model 4> in 0(nT) time and 0(n) space. In contrast, 
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it is possible to evaluate the probability p c (x T \(f) according to an interpolated 
model in 0{nT) time and O(l) space. Again, the small additional cost in space 
is negligible when compared to the cost of storing the model parameters. 

Having done all the work in the expectation step, the maximization step is 
straightforward. 

MAXIMIZATION-STEP(0,A+ ,A~) 

1. Forall states x i in A^ n 

2. A(x») := A+(x t )/(A+(a; 4 ) + \-(x 1 )); 

Line 2 rcestimates each non-emitting state transition parameter A(x 4 ) as the 
expectation of emitting a symbol from that state divided by the expectation 
of being in that state. In order to ensure that no non-emitting state transi- 
tion parameter A(x 4 ) is ever reestimated to or 1, we typically initialize each 
accumulator to a small positive number (eg., 0.1) instead of zero. 

When A parameters are tied, then their A + and A~ expectations must be 
pooled before they are updated. Let r(x l ) be the equivalence class of x % under 
the tying scheme r. For simplicity, imagine t(x 1 ) to be an index. All algorithms 
in this section would use the tied parameter X(t(x 1 )) instead of the untied 
parameter \(x l ). The tied-expectation-STEp() algorithm would increment 
the A+(r(x i )) and A _ (r(x i )) accumulators, and the tied-maximization-STEpQ 
algorithm would be as follows. 

TIED-MAXIMIZATION-STEP(</>, A+ ,A~ ,71") 

1. Forall classes i in r(A^ n ) 

2. A(i):=A+(i)/(A+(i) + A-(i)); 

In some situations, cross-estimation may be approximated by forward-estimation. 
Like cross-estimation, forward-estimation initializes the S parameters on one 
text block and optimizes the A parameters on another block. Forward-estimation 
uses only a single text partition whereas cross-estimation uses all one-block 
text partitions. As result, forward-estimation is considerably faster than cross- 
estimation, both in the amount of time required per iteration and in the num- 
ber of iterations until convergence. Unfortunately, it can lead to inferior results 
when there are too many mixing parameters. 

FORWARD-ESTIMATE^^ ,0) 

1. Until convergence 

2. Initialize A+, A~ to zero; 

3. Initialize 5 using B$; 

4. EXPECTATION-STEP(i?A,0,A + ,A~); 

5. MAXIMIZATION-STEP((/>,A + ,A~); 

6. Initialize S using B$ U B x ; 
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Implementation Note. Unless the corpus and the alphabet size are very 
small, then the at(i) and 0t(i) values used in the EXPECTATION- step () proce- 
dure will exceed the representational range of double precision IEEE floating 
point numbers. When this happens, a floating point exception will occur and an 
alternate representation must be used for the probability values. The simplest 
approach is to use a logarithmic representation. Multiplication and division of 
probability values is straightforward in a logarithmic representation. 



Addition of logarithmic probability values is more costly, and care must be taken 
to avoid underflow. 



Here A is the smallest representable exponent, for example, -707.7 for IEEE 
double precision floating point numbers when the logarithms are natural (ie., 
base e). This test is necessary to avoid underflow in the call to exp(). 

While it is simple to implement, logarithmic arithmetic can be 15-50 times 
slower than straight probability arithmetic, depending on the speed of the float- 
ing point unit and the math library provided with the operating system. For 
this reason, our implementation used an extended exponent representation from 
the library of practical abstractions [pi} . This balanced_t module provides sin- 
gle precision floating point numbers with 32 bit exponents. It is 1.5 to 3.0 times 
faster than the logarithmic representation, depending on the machine. 

When computation time is at a premium, then the most effective solution is 
to periodically scale the probability values in the at(i) and arrays to keep 
them in an acceptable range. Scaling is more difficult to implement than loga- 
rithmic arithmetic or balanced_t arithmetic, and it is inherently nonmodular. 

6 Empirical Results 

The ultimate measure of a statistical model is its predictive performance in 
the domain of interest. To take the true measure of non-emitting models for 
natural language texts, we evaluate their performance as character models on 
the Brown corpus Q and as word models on the Wall Street Journal. Our results 
show that the non-emitting Markov model consistently gives better predictions 
than the traditional interpolated Markov model under equivalent experimental 
conditions. In all cases we compare non-emitting and interpolated models of 
identical model orders, with the same number of parameters. Note that the 
non-emitting bigram and the interpolated bigram are equivalent. 



log(x ■ y) 
\og(x/y) 



log(x) + log(y) 
log(x) - log(y) 




log(a;) + log(l + exp(log(y) — log(x))) otherwise 



if log(y) - log(x) < A 
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Corpus 


Alphabet 


Size 


Blocks 


Brown 


90 


6,004,032 


21 


WSJ 1989 


20,293 


6,219,350 


22 


WSJ 1987-89 


20,092 


42,373,513 


152 



All A values were initialized uniformly to 0.5 and then optimized using cross- 
estimation on the first 90% of each corpus. The remaining 10% percent of each 
corpus was used to evaluate model performance. While this validation paradigm 
exposes the models to nonstationarity, it is simple to understand and easily 
reproduced. 

We consider a single parameter tying scheme, in which all states with the 
same frequency and diversity are considered equivalent. The frequency c(x l ) of 
a state is the number of times that the string x l occurred in the training corpus. 
The diversity q(x l ) = \{y : c{x l y) > 0}| of a state is the number of distinct 
symbols observed in the state. Experience with multinomial prediction suggests 
that frequency and diversity are necessary to accurately estimate the likelihood 
of novel symbols [pH 



In related work 1 25 , Thomas compares the performance of the interpolated 
and non-emitting models on the Brown corpus and Wall Street Journal with ten 
different parameter tying schemes. His experiments confirm that some parame- 
ter tying schemes improve model performance, although to a lesser degree when 
cross-estimation is used. The non-emitting model consistently outperformed 
the interpolated model on both corpora for all ten parameter tying schemes. 
Thomas shows that our frequency-diversity parameter tying scheme is one of 
the more effective parameter schemes. 



6.1 Brown Corpus 

Our first set of experiments were with character models on the Brown corpus 
@. The Brown corpus is an eclectic collection of English prose, containing 
6,004,032 characters partitioned into 500 files. We performed 10 iterations of 
cross estimation on 21 blocks. Results are reported as per-character test message 
entropies (bits/char), — - log 2 p(y v \v). The non-emitting model outperforms 
the interpolated model for all nontrivial model orders, particularly for larger 
model orders. The non-emitting model is considerably less prone to overfitting. 
After 10 EM iterations, the untied order 9 non-emitting model scores 1.996 
bits/char while the untied order 9 interpolated model scores 2.334 bits/char. 
The untied non-emitting model even outperforms the tied interpolated model 
for all nontrivial model orders. 
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Model 
order 



Interpolation 
untied tied 



Non-Emitting 
untied tied 



3.602 



3.602 



2.950 
2.490 
2.231 
2.149 
2.164 
2.212 
2.277 
2.334 



2.950 
2.486 
2.218 
2.112 
2.082 
2.077 
2.084 
2.093 



3.602 



2.946 



2.473 



2.193 
2.076 
2.031 
2.015 
2.010 
2.009 



3.602 



2.946 



2.473 



2.192 



2.075 



2.027 



2.008 



2.000 



1.996 



We also compared the performance of our techniques with two new interpo- 
lation schemes recently proposed by Potamianos and Jelinek |16| . Their DI-TD 
scheme uses hierarchical state-conditional interpolation X(x l ), variable- width 
frequency x order parameter tying, and "top-down optimization" on one with- 
held block. Their DI-BU scheme uses general state-conditional interpolation 
X^lx 1 ), variable-width frequency x order parameter tying, and bottom-up op- 
timization on one withheld block. The comparison is performed on a modified 
version of the Brown corpus, which they provided to us. This modified corpus 
eliminates the unusual punctuation of the original Brown corpus, reduces the 
alphabet size from 90 to 79, and separates distinct linguistic tokens with single 
spaces. 



Corpus 


Alphabet 


Size 


Train 


Test 


Blocks 


Brown (std) 


90 


6,004,032 


5,403,629 


600,403 


21 


Brown (JHU) 


79 


6,093,662 


5,607,270 


486,392 


21 



Another difference between the Potamianos- Jelinek validation paradigm and 
ours lies in how the corpus is partitioned into training and testing blocks. In 
our experiments, the test block was the last 10% of the Brown corpus - the last 
428 characters from br-nl4.txt plus all files from br-nl5.txt through br-r09.txt 
inclusive. In the Potamianos- Jelinek experiments, the test block consisted of 
complete sentences chosen uniformly from the entire (modified) Brown corpus. 

To this comparison, we added the original interpolation schemes of Je- 
linek and Mercer jl2| under 10 iterations of forward-estimation (DI-FE) and 
cross-estimation (DI-CE). Both models used hierarchical state-conditional in- 
terpolation X(x l ) and straight frequency x diversity parameter tying. We also 
added the hierarchical non-emitting model with straight frequency x diversity 
parameter tying, and 10 iterations of forward-estimation (NE-FE) and cross- 
optimization (NE-CE). The results are summarized in the following table as 
mean test message entropies (bits/char). 
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Model 
order 



DI-TD 



Interpolation 
DI-BU DI-FE 



DI-CE 



3.470 



2.851 
2.328 
2.016 
1.894 
1.853 
1.837 
1.828 
1.824 



3.470 



2.850 



2.326 
2.007 
1.878 
1.831 
1.811 
1.801 
1.796 



3.478 
2.860 
2.337 
2.012 
1.872 
1.820 
1.804 
1.800 
1.802 



3.478 
2.858 
2.331 
2.007 
1.867 
1.815 
1.800 
1.796 
1.798 



Non-Emitting 
NE-FE NE-CE 



3.478 
2.857 
2.328 
1.996 
1.849 
1.789 
1.761 
1.746 
1.738 



3.478 
2.856 



2.324 



1.991 



1.843 



1.782 



1.754 



1.739 



1.730 



The non-emitting model consistently outperforms all interpolation schemes at 
all model orders above 2, by a significant margin. The original Jclinek-Mercer 
interpolation scheme also tends to outperform the two new DI-TD and DI-BU 
schemes at higher model orders, for both forward-estimation (DI-FE) and cross- 
estimation (DI-CE). 

Note also that the best order 9 result in the Potamianos-Jelinek paradigm 
(1.730 bits/char) is considerably better than the best order 9 result in our val- 
idation paradigm (1.996 bits/char). We believe this is partially attributable 
to the reduced alphabet size of the modified corpus, and principally due to 
the difference in the two train-test partitions. The prediction problem posed 
by our paradigm is more difficult because the last 10% of the Brown files are 
appreciably different than the first 90% of the files. 



6.2 WSJ 1989 

The second set of experiments was on the 1989 Wall Street Journal corpus, which 
contains 6,219,350 words. Our vocabulary consisted of the 20,293 words that 
occurred at least 10 times in the entire WSJ 1989 corpus. All out-of-vocabulary 
words were mapped to a unique OOV symbol. We performed 10 iterations of 
cross estimation on 22 blocks. Following standard practice in the speech recog- 
nition community, results are reported as per-word test message perplexities 
p(y v \v)~~. The perplexity represents the effective alphabet size. Again, the 
non-emitting model outperforms the interpolated model for all nontrivial model 
orders, even without parameter tying. 
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Model 


Interpolation 


Non-Emitting 


order 


untied 


tied 


untied 


tied 


1 


175.2 


174.9 


175.2 


174.9 




2 


123.7 


122.8 


119.6 


119.0 




3 


121.3 


119.0 


111.9 


111.1 




4 


123.0 


117.2 


110.6 


109.5 




5 


124.5 


116.3 


110.4 


109.0 





6.3 WSJ 1987-89 

The third set of experiments was on the 1987-89 Wall Street Journal corpus, 
which contains 42,373,513 words. Our vocabulary consisted of the 20,092 words 
that occurred at least 63 times in the entire WSJ 1987-89 corpus. Again, all 
out-of-vocabulary words were mapped to a unique OOV symbol. We performed 
10 iterations of cross estimation on 152 blocks. Results are reported as test 
message perplexities. As with the WSJ 1989 corpus, the non-emitting model 
outperforms the interpolated model for all nontrivial model orders, even without 
parameter tying. 



Model 
order 



Interpolation 



Non-Emitting 
tied 



untied 


tied 


untied 


150.7 


150.7 


150.7 




94.0 


93.9 




92.1 




89.2 


88.6 




83.2 





150.7 



92.1 
83.2 



6.4 Posthoc Analysis 

In order to understand the striking empirical advantage of the non-emitting 
model over the interpolated model, we conducted the following experiment. We 
induced order 9 interpolated and non-emitting models from the Brown cor- 
pus using forward estimation with no parameter tying. This configuration was 
chosen to maximize the performance difference between the two models. The re- 
sulting interpolated model predicts the Brown test corpus with 2.4480 bits/char 
while the resulting non-emitting model predicts the Brown test corpus with 
2.1536 bits/char. 

The following table shows the mean state order occupancy statistics for the 
two models on the Brown corpus. 
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Order 



Interpolated Non-Emitting 



9 
8 
7 

6 
■5 
4 
3 

2 
1 




0.133 0.070 

0.120 0.090 

0.131 0.127 

0.147 0.170 

0.147 0.195 

0.130 0.173 

0.095 0.108 

0.058 0.047 

0.027 0.013 

0.011 0.003 



5.639 5.357 



As might be expected, the interpolated model spends more time than the non- 
emitting model in the higher order states (orders 7-9). It is arguably more 
surprising, however, that the interpolated model also spends more time in the 
lower order states (orders 0-2). 

One point where the non-emitting model outperforms the interpolated model 
is in predicting the space u that follows the string , u but u now u Keith in the 
Brown test corpus. Unfortunately, the string u Keith does not occur in the 
training corpus. Nonetheless, the non-emitting model assigns 209 times more 
probability than the interpolated model to the event that a space will follow 
the string u Keith. According to the non-emitting model, a space will follow the 
string u Keith with probability 0.627. The interpolated model assigns probabil- 
ity 0.003 to the same event. 

The reason is somewhat subtle. On the training corpus, the string eith 
is followed by the letter e with near certainty (0.9973). As a result, A(eith) 
approaches unity in both the interpolated and non-emitting models. Since the 
model order 9 is sufficiently high, the interpolated model will use the eith state 
whenever it occurs and no higher order state is preferred (see figure Q). 

The hierarchical non-emitting model has no such freedom (see figure ^|). 
In order to reach the eith state, it must accurately predict every symbol in 
the string eith. Otherwise, it will be forced to a lower order state along the 
way. The transition to a lower order state occurs when the non-emitting model 
attempts to predict the symbol t from the state ei. Since ei is rarely followed 
by t in the training corpus (.0761), the non-emitting model is forced into the 
lower order state i, from which it is able to predict the symbol t with greater 
probability (.1172). As a result, the non-emitting model is never able to reach 
the eith state. Instead, it must predict the space u after u Keith using the state 
ith. This works quite well because ith is followed by u with high probability 
in the training corpus (0.6136). 
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9 


0.549 


0.446 












8 


0.275 


0.223 












7 


0.137 


0.049 












6 


0.037 


0.082 












5 


0.001 


0.038 












4 




0.067 










1.000 


3 




0.050 




0.376 








2 




0.025 


0.856 


0.617 


0.527 


0.596 




1 




0.015 


0.142 


0.004 


0.415 


0.297 









0.004 


0.002 


0.004 


0.058 


0.106 






u 


K 


e 


i 


t 


h 


u 




0.998 


0.000 


0.273 


0.006 


0.093 


0.226 


0.003 



Figure 1: State occupancy probabilities for the order 9 interpolated model on 
part of the Brown test corpus (2.4480 bits/char). The horizontal axis represents 
the position in the test string and the vertical access represents the hidden state 
order. The bottom column shows the conditional probability of the symbol, 
given the hidden state distribution. Thus the interpolated model is in the order 
4 state eith with probability at least .9995 when predicting the final symbol, 
and it assigns probability 0.003 to this symbol. 



9 
8 


0.166 
0.289 


0.121 
0.369 












7 


0.340 


0.449 












6 


0.161 


0.016 












5 


0.035 


0.017 












4 


0.008 


0.017 












3 




0.008 




0.472 






0.734 


2 




0.002 


0.939 


0.527 


0.848 


0.749 


0.187 


1 




0.001 


0.061 


0.001 


0.118 


0.207 


0.003 













0.033 


0.044 


0.075 




u 


K 


e 


i 


t 


h 


u 




0.977 


0.000 


0.277 


0.006 


0.081 


0.232 


0.627 



Figure 2: State occupancy probabilities for the order 9 non-emitting model on 
part of the Brown test corpus (2.1536 bits/char). The horizontal axis represents 
the position in the test string and the vertical access represents the hidden state 
order. The bottom column shows the conditional probability of the symbol, 
given the hidden state distribution. Thus the non-emitting model is in the 
order 3 state ith with probability 0.734 when predicting the final symbol, and 
it assigns probability 0.627 to this symbol. 
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6.5 Posterior Tying 

This posthoc analysis led John Lafferty (personal communication) to suggest 
that the interpolated model might be able to approximate the empirical per- 
formance of the non-emitting model with a suitable parameter tying scheme. 
According to the non-emitting model, two states should be considered equiva- 
lent if they are equally effective at predicting the future and they are equally 
well predicted by the model. A state is well-predicted if the string that it rep- 
resents is assigned high probability, relative to the other states available at the 
time. A state provides strong predictions if the entropy of its emitting state 
transition probabilities is low. 

The most effective way for the interpolated model to mimic the non-emitting 
model is to tie its states based on their expectations in the corresponding non- 
emitting model. In order to avoid implementing the non-emitting model, we may 
reasonably impose a uniform distribution on the non-emitting state transitions. 
And in order to avoid running the full expectation-STEp() algorithm, we may 
approximate the non-emitting state expectations by their forward expectations 
in 0{nT) time and 0(n) space. 

A further simplification is to use the mean empirical posterior probability. 
The mean empirical posterior of a state is the empirical expectation 5[x l ] of the 
state divided by its frequency c(x l ). The empirical expectation 5[x l |y T ] of an 
i th order state x l in an order n mixture Markov model with respect to a string 
y T is computed as follows 

5[Ay T ]= J2 6(°t = i\y% 

{t:x i =yl +1 _ i } 

with the empirical posterior 

Note that <5[a; l |y T ] may be calculated for all states in 0(nT) time using dynamic 
programming. The empirical posterior S(o = of the i th order state at time 
t could be weighted also by its predictive success — log<5(y t+ i|y* +1 _ i ). A further 
refinement is to compute the mean empirical posterior on withheld data. 

As a final step, these values must be quantized to a finite number of levels 
to construct the parameter tying scheme. 

7 Conclusion 

In this report, we propose a time series model that combines Markovian events 
of varying orders using stochastic non-emitting transitions. We prove that the 
resulting class of non-emitting Markov models is strictly more powerful than 
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the class of Markov models, including interpolated and backoff models. More 
importantly, our empirical investigation reveals that the non-emitting model 
consistently outperforms the strongest interpolated Markov models on natural 
language texts, with only a modest increase in computational requirements. 

The expressive power of the non-emitting model comes from its ability to 
represent additional information in its state order distribution. To prove that 
the non-emitting model was strictly more powerful than any Markov model, we 
used the state order distribution to represent an unbounded dependency. In 
our posthoc analysis, we revealed how the model uses its hidden state order 
distribution to remember the short-term effectiveness of all available Markovian 
states. 

The non-emitting model succeeds empirically because it imposes a pseudo- 
Bayesian discipline on maximum likelihood techniques. The interpolated model 
will favor a high-order state if it provides strong predictions on withheld data. 
The non-emitting model will favor a high-order state if the state provides strong 
predictions on withheld data and it is well-predicted by the model. In order to 
reach a high order state, the non-emitting model must assign high probability 
to each symbol in that state. Otherwise, the non-emitting model will be forced 
to transition to a lower order state at a previous time step and will not be able 
to reach the high order state. Thus, the state occupancies of the non-emitting 
model are influenced as much by their prior probabilities (pseudo-Bayes) as their 
past ability to predict the future (maximum likelihood). 

Finally, we note the use of non-emitting transitions is a general modeling 
technique that may be employed in any time series model, for symbolic domains 
and for continuous domains. 
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A Backoff 



The backoff modef is arguably the most widely used statistical language model, 
due in large part to its ease of implementation, computational efficiency, reason- 
able performance at lower model orders, and an influential paper Jl^]. Backoff 
models are also widely used in the data compression community, in large part 
due to their computational efficiency ||. Here we review the backoff model, 
establish the equivalence of backoff models and basic Markov models, and then 
specify a class of non-emitting backoff models that is strictly more powerful than 
the class of traditional backoff models. 

In a backoff model, event probabilities are combined according to a partial 
order. Typically, higher order events are preferred over lower order events. The 
event probabilities are rescaled as we move through the partial order so that the 
derived probability function is valid. The efficacy of the backoff model depends 
on the events that are included in the model, their individual probabilities, and 
the order in which they are combined. 

Formally, a hierarchical backoff model 6 = (A, E, 8) consists of an alphabet 
A, a dictionary E of selected state transitions, E C A* x A, and the state 
transition probabilities 8 : E — > [0,1]. The state transition probabilities 8 are 
extended to an unbounded domain by selecting the maximal suffix of the history 
that appears with the relevant symbol in the dictionary E of state transitions. 



where 7](x ) rescales the conditional probability distribution as we backoff from 
higher order events to lower order events 



The rescalar rj(x l ) is computed directly form the transition probabilities (5(-|x l ) 
in conjunction with the transition dictionary E. It is not a free parameter. 

A hierarchical backoff model is a valid probability model if the dictionary E 
includes every th order state transition - {e} x A C E - and every 6(E(x l )\x l ) 
is a valid probability function. A backoff model is nonzero for all strings A* if 
every 8{y\x l ) is nonzero and no 8{E(x l )\x l ) is unity when E(x l ) C A. 

In order to induce a hierarchical backoff model from data, we must select 
the state transition dictionary and estimate its probabilities. One simple - but 
highly effective - selection technique is to include every state transition whose 
frequency exceeds a fixed threshold, that may depend on the state order. More 
effective selection techniques require significant computational resources p2fl. 




r,(x*) = (1 - 8{E{x l )\x i ))/{l — pb{E{x t )\x\)) 



and E(x' 1 ) is the set of symbols available in the context x % . 



E{x") = {y : x % y e E} 
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The state transition probabilities S(y\x ) are typically assigned by multinomial 
estimates, cither as conditional events y\x l in the symbol alphabet A or as joint 
events x l y in the string alphabet A l+1 . The most widely used multinomial 
estimates for statistical language modeling employ some form of discounting 
~I1 |ll| , although other estimators have also been shown to be effective 
1§ |2lf- 

A valid backoff model 9 whose event dictionary E is a subset of A n+1 can 
be converted into an equivalent basic Markov model </>' of order n, simply by 
setting 5' n (xt\xlz n ) equal to pb(xt\x\Zn, 9). Every basic model is a backoff model 
with a complete state transition dictionary. Consequently, the class of backoff 
models is extensionally equivalent to the class of basic Markov models. 

The hierarchical non-emitting backoff model 9 = (A, E, 5) has the same 
parameterization as the traditional backoff model. Unlike the traditional model, 
the backoff from the state x l to its maximal proper suffix x\ is permanent in 
the non-emitting backoff model. 

» e (VV 9) = { 5 ^\ xi )P4^\ xiyx ^ tf{x\ yi )EE ,g. 
c ' \ r l( xl )Pb(y J '\x\, <f>) otherwise 

The rescalar rj(x l ) is identical in both version of the backoff model. 

The class of non-emitting backoff models is strictly more powerful than the 



class of basic Markov models, by a similar argument as in lemma 4.1. Although 
the backoff model does not have any mixing parameters, we may use the pres- 
ence or absence of a state transition y\x % in the dictionary E to control the 
hidden state order. Conversely, every order n backoff model can be converted 
into an equivalent non-emitting backoff model with a complete state transition 
dictionary E — A n+1 . Therefore, the class of non-emitting backoff models is 
strictly more powerful than the class of simple backoff models. 
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